Goals

In this file, we will perform compositional analysis of our scaled/noramlized/rarefied microbial dataset!

  1. Load in scaled/normalized phyloseq object.
  2. Calculate the relative abundances of taxonomic groups at various levels: A. Phylum B. Genus C. ASV
  3. Plot them, and narrow in on specific taxnomic group of interest

Inputs

  1. We will need the scaled_physeq.RData, which includes a rooted tree that we created in analysis/06_Ordination/scaled_physeq.RData.

Outputs

  1. Beautiful visualizations of microbial taxa and how they vary across parameters related to our study: station (categorical) and salinity (continuous).
  2. Run some stats!

Compositional Data

Microbial abundance data—like 16S rRNA gene or metagenomics data—are typically compositional: they represent relative abundances constrained to a constant total (e.g., percent or proportions). This introduces spurious correlations and other issues if analyzed with traditional statistics. This is a very important limitation to microbial data!

Interpretation 1: Interpreting microbial abundance from relative (aka. compositional) rather than absolute counts does not allow data to be compared between studies. Additionally, since all abundances are relative, when one abundance increases another one must decrease, even if the absolute abundance did not actually change. This can introduce false positives.

Load Packages

pacman::p_load(tidyverse, devtools, DT, phyloseq, patchwork, install = FALSE)

# load colors
source("code/colors.R")

1. Load Data

The following phyloseq object contains microbial community composition data in a standardized format. In this case, we’ve already normalized the reads (scaled to 17,323 per sample), which is essential for comparing relative abundances.

#load physeq
load("data/06_Ordination/scaled_physeq.RData")

# look at the data
scaled_physeq
## phyloseq-class experiment-level object
## otu_table()   OTU Table:         [ 6060 taxa and 43 samples ]
## sample_data() Sample Data:       [ 43 samples by 38 sample variables ]
## tax_table()   Taxonomy Table:    [ 6060 taxa by 9 taxonomic ranks ]
## phy_tree()    Phylogenetic Tree: [ 6060 tips and 6059 internal nodes ]
# Intuition check - scaled at 17,323
min(sample_sums(scaled_physeq))
## [1] 17323
range(sample_sums(scaled_physeq))
## [1] 17323 17404

Taxonomic Analysis !

In this analysis, we will drill down from phylum to ASV level analyses, which will enable increasingly detailed insights into microbial diversity and potential ecological roles. However, it is also important for us to remember that deeper levels also come with increased noise and data sparsity, especially for rare groups.

Phylum

Taxonomic analysis often begins at broader levels (e.g., Phylum) to visualize overarching trends before zooming in on finer-scale patterns. This step allows us to identify which microbial groups dominate across samples and which may respond to environmental gradients like salinity.

Now, let’s calculate the relative abundance of the phyla across all the samples. NOTE: To do this, it is imperative that we have scaled the data to a constant sequencing depth.

#Create a phylum level datafram
phylum_df <-
  scaled_physeq %>%
  # Agglomerate all ASV counts within a phylum
  tax_glom(taxrank = "Phylum") %>% # We have 31 Phyla
  # Calculate relative abundance !
  transform_sample_counts(function(x) {x/sum(x)}) %>%
  #create df from phyloseq object
  psmelt() %>%
  # Filter out phyla < 1%
  dplyr::filter(Abundance > 0.01) %>%
  # fix the order of age sample site
  mutate(SampleSite = fct_relevel(SampleSite, c("CLOACA", "ORAL", "TANK WATER")),
         AgeRange = fct_relevel(AgeRange, c("JUVENILE", "SUB-ADULT",
                                          "ADULT")))

## What are the phylum abundances? 
phylum_df %>%
  group_by(Phylum) %>%
  summarize(mean_PercAbund = round(mean(Abundance), digits = 4)) %>%
  arrange(-mean_PercAbund) %>%
  datatable()
# Make a list of the top phyla 
top10_phyla <- 
  phylum_df %>%
  group_by(Phylum) %>%
  summarize(mean_PercAbund = mean(Abundance)) %>%
  arrange(-mean_PercAbund) %>%
  head(n = 10) %>%
  pull(Phylum)

# intuition check - should be the same 
top10_phyla
##  [1] "Pseudomonadota"          "Bacteroidota"           
##  [3] "Bacillota"               "Dependentiae"           
##  [5] "Spirochaetota"           "Balneolota"             
##  [7] "Verrucomicrobiota"       "Thermodesulfobacteriota"
##  [9] "Myxococcota"             "Planctomycetota"

Interpretation 2: Pseudomonadota, Bacteroidota, and Bacillota have the highest relative abundance in my dataset. Their abundaces are 58%, 37%, and 12%, respectively.

Stacked Bar plots

Visualization helps detect patterns in composition and abundance that statistical models may later test. Stacked bar plots are often used but can obscure individual sample variation and visualize too much data all at once.

Therefore, we will also explore faceted and jittered boxplots below the bar plots to see sample-level trends more clearly.

# Stacked Bar Plot With All phyla 
# Plot Phylum Abundances - make sure to load phylum_colors 
phylum_df %>%
  dplyr::filter(Phylum %in% top10_phyla) %>%
  # Warning: It's important to have one sample per x value, 
  # otherwise, it will take the sum between multiple samples
  # chose one turtle for juvenile, sub-adult, and adult, but I don't think this looks right
  dplyr::filter(TurtleName == c("MARVIN", "MARO", "SAMBA")) %>%
  ggplot(aes(x = SampleSite, y = Abundance, fill = Phylum)) + 
  facet_grid(.~AgeRange) + 
  geom_bar(stat = "identity", color = "black") + 
  labs(title = "Top 10 Phyla: Surface Samples") + 
  scale_fill_manual(values = phylum_colors) + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1))

#Relative abundance needs to be between 0 and 1, or else you're plotting more than one sample
# Stacked Bar Plot looking at all samples and grouping by age range
# Plot Phylum Abundances - make sure to load phylum_colors 
phylum_df %>%
  dplyr::filter(Phylum %in% top10_phyla) %>%
  ggplot(aes(x = Sample, y = Abundance, fill = Phylum)) + 
  facet_grid(. ~ AgeRange, scales = "free_x", space = "free_x") + 
  geom_bar(stat = "identity", color = "black") + 
  labs(title = "Top 10 Phyla per Turtle") + 
  scale_fill_manual(values = phylum_colors) + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
        strip.background = element_blank(),
        panel.spacing = unit(1, "lines"))

# Stacked Bar Plot looking at all samples and grouping by sample site 
# Plot Phylum Abundances - make sure to load phylum_colors 
phylum_df %>%
  dplyr::filter(Phylum %in% top10_phyla) %>%
  ggplot(aes(x = Sample, y = Abundance, fill = Phylum)) + 
  facet_grid(. ~ SampleSite, scales = "free_x", space = "free_x") + 
  geom_bar(stat = "identity", color = "black") + 
  labs(title = "Top 10 Phyla per Turtle") + 
  scale_fill_manual(values = phylum_colors) + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
        strip.background = element_blank(),
        panel.spacing = unit(1, "lines"))

# Stacked Bar Plot looking at all samples and grouping by age AND sample site
# Plot Phylum Abundances - make sure to load phylum_colors 
phylum_df %>%
  filter(Phylum %in% top10_phyla) %>%
  ggplot(aes(x = Sample, y = Abundance, fill = Phylum)) + 
  facet_grid(SampleSite ~ AgeRange, scales = "free_x", space = "free_x") + 
  geom_bar(stat = "identity", color = "black") + 
  labs(title = "Top 10 Phyla per Turtle by Sample Site and Age Range") + 
  scale_fill_manual(values = phylum_colors) + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
        strip.background = element_blank(),
        panel.spacing = unit(1, "lines"))

Faceted Bar plot

To help compare the phylum abundance between sample types, we can facet by phylum to better see how the changes occur across ages, which is masked in the stacked bar plot. It’s a little better than the stacked bar plot, however, we can do even better!

# Group by Age 
phylum_df %>%
  dplyr::filter(Phylum %in% top10_phyla) %>%
  ggplot(aes(x = Sample, y = Abundance, fill = Phylum)) + 
  facet_grid(Phylum~AgeRange, scale = "free") + 
  # add the stacked bar 
  geom_bar(stat = "identity", color = "black") + 
  # change the colors to be our selected colors 
  scale_fill_manual(values = phylum_colors) + 
  theme_bw() + 
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1))

# Group by Sample Site
phylum_df %>%
  dplyr::filter(Phylum %in% top10_phyla) %>%
  ggplot(aes(x = Sample, y = Abundance, fill = Phylum)) + 
  facet_grid(Phylum~SampleSite, scale = "free") + 
  # add the stacked bar 
  geom_bar(stat = "identity", color = "black") + 
  # change the colors to be our selected colors 
  scale_fill_manual(values = phylum_colors) + 
  theme_bw() + 
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1))

### Or combined together: 
phylum_df %>%
  dplyr::filter(Phylum %in% top10_phyla) %>%
  ggplot(aes(x = AgeRange, y = Abundance, fill = Phylum, color = Phylum)) + 
  facet_grid(Phylum~SampleSite, scale = "free") + 
  # add the stacked bar 
  geom_jitter() +
  geom_boxplot(outlier.shape = NA, alpha = 0.5) + 
  # change the colors to be our selected colors 
  scale_fill_manual(values = phylum_colors) + 
  scale_color_manual(values = phylum_colors) + 
  theme_bw() + 
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1))

All Phyla by Age:

### Or combined together: 
phylum_df %>%
  dplyr::filter(Phylum %in% top10_phyla) %>%
  ggplot(aes(x = AgeRange, y = Abundance, fill = Phylum, color = Phylum)) + 
  facet_wrap(Phylum~., scales = "free", nrow = 2) + 
  # add the stacked bar 
  geom_jitter() +
  geom_boxplot(outlier.shape = NA, alpha = 0.5) + 
  # change the colors to be our selected colors 
  scale_fill_manual(values = phylum_colors) + 
  scale_color_manual(values = phylum_colors) + 
  theme_bw() + 
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1))

Cloaca Phyla by Age:

phylum_df %>%
  dplyr::filter(Phylum %in% top10_phyla) %>%
  dplyr::filter(SampleSite == "CLOACA") %>%
  ggplot(aes(x = AgeRange, y = Abundance, fill = Phylum, color = Phylum)) + 
  facet_wrap(Phylum~., scales = "free", nrow = 2) + 
  # add the stacked bar 
  geom_jitter() +
  geom_boxplot(outlier.shape = NA, alpha = 0.5) + 
  # change the colors to be our selected colors 
  scale_fill_manual(values = phylum_colors) + 
  scale_color_manual(values = phylum_colors) + 
  theme_bw() + 
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1))

Oral Phyla by Age:

phylum_df %>%
  dplyr::filter(Phylum %in% top10_phyla) %>%
  dplyr::filter(SampleSite == "ORAL") %>%
  ggplot(aes(x = AgeRange, y = Abundance, fill = Phylum, color = Phylum)) + 
  facet_wrap(Phylum~., scales = "free", nrow = 2) + 
  # add the stacked bar 
  geom_jitter() +
  geom_boxplot(outlier.shape = NA, alpha = 0.5) + 
  # change the colors to be our selected colors 
  scale_fill_manual(values = phylum_colors) + 
  scale_color_manual(values = phylum_colors) + 
  theme_bw() + 
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1))

Tank Water Phyla by Age:

phylum_df %>%
  dplyr::filter(Phylum %in% top10_phyla) %>%
  dplyr::filter(SampleSite == "TANK WATER") %>%
  ggplot(aes(x = AgeRange, y = Abundance, fill = Phylum, color = Phylum)) + 
  facet_wrap(Phylum~., scales = "free", nrow = 2) + 
  # add the stacked bar 
  geom_jitter() +
  geom_boxplot(outlier.shape = NA, alpha = 0.5) + 
  # change the colors to be our selected colors 
  scale_fill_manual(values = phylum_colors) + 
  scale_color_manual(values = phylum_colors) + 
  theme_bw() + 
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1))

Phyla by Sample Site:

### Or combined together: 
phylum_df %>%
  dplyr::filter(Phylum %in% top10_phyla) %>%
  ggplot(aes(x = SampleSite, y = Abundance, fill = Phylum, color = Phylum)) + 
  facet_wrap(Phylum~., scales = "free", nrow = 2) + 
  # add the stacked bar 
  geom_jitter() +
  geom_boxplot(outlier.shape = NA, alpha = 0.5) + 
  # change the colors to be our selected colors 
  scale_fill_manual(values = phylum_colors) + 
  scale_color_manual(values = phylum_colors) + 
  theme_bw() + 
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1))

notes - some phyla unique to adults or sub-adults? (Dependentiae, Thermodesulfobacteriota, Balneolota) - some phyla shared between ages (Bacteroidota, Pseudomonadota) - some phyla differ across ages (sub-adult Verrucomicrobiota)

After initial exploration, we focus on specific phyla that appear to vary across ages. These targeted plots help develop hypotheses about ecological drivers.

  • Verrucomicrobiota - paper said this phyla appears more in juveniles
  • Myxococcota
  • Planctomycetota
  • Bacteroidota –> paper said juveniles had more Flavobacteriales/Tenacibaculum and adults had more Bacteriodales
    • Tenacibaculum is marine pathogen

A1. Verrucomicrobiota

# Narrow in on a specific group
# Verrucomicrobiota - y: abundance, x: age, dot plot + boxplot
verrucom_phylum_AgeRange <- 
  phylum_df %>%
  dplyr::filter(Phylum == "Verrucomicrobiota") %>%
  # build the plot 
  ggplot(aes(x = AgeRange, y = Abundance, 
             fill = AgeRange, color = AgeRange)) + 
  geom_boxplot(alpha = 0.5, outlier.shape = NA) + # outliers not plotted here in boxplot 
  geom_jitter() + 
  theme_bw() + 
  labs(title = "Verrucomicrobiota Phylum") + 
  scale_color_manual(values = AgeRange_colors) + 
  scale_fill_manual(values = AgeRange_colors) + 
    theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
          legend.position = "right")

# Separate by Sample Site
phylum_df %>%
  filter(Phylum == "Verrucomicrobiota") %>%
  ggplot(aes(x = AgeRange, y = Abundance, fill = AgeRange, color = AgeRange)) + 
  geom_boxplot(alpha = 0.5, outlier.shape = NA) + 
  geom_jitter(width = 0.2, alpha = 0.6) + 
  facet_wrap(~ SampleSite) + 
  #stat_compare_means(method = "kruskal.test", label = "p.format") + 
  scale_color_manual(values = AgeRange_colors) + 
  scale_fill_manual(values = AgeRange_colors) + 
  labs(title = "Verrucomicrobiota Phylum by Sample Site and Age Range") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
        legend.position = "right")

# Statistically: Kruskall-Wallis followed by a Tukey's Posthoc test
# These are non-parametric (non-normal) stat tests 

# CONTINUOUS 
verrucom_phylum_weight <- 
  phylum_df %>%
  dplyr::filter(Phylum == "Verrucomicrobiota") %>%
  ggplot(aes(x = Weight, y = Abundance)) + 
  geom_point(aes(color = AgeRange)) + 
  theme_bw() + 
  geom_smooth(method = "lm",  formula = y ~ poly(x, 2)) + 
  labs(title = "Verrucomicrobiota Phylum") + 
  scale_color_manual(values = AgeRange_colors) + 
  theme(legend.position = "none")

# Collect both of the plots together into one 
verrucom_phylum_AgeRange + verrucom_phylum_weight + 
  plot_layout(guides = "collect") +
  plot_annotation(tag_levels = "A")

Def more in juveniles when looking at Cloaca Possibly more in sub-adults when looking at Oral

A2. Myxococcota

# Narrow in on a specific group
# Myxococcota - y: abundance, x: age, dot plot + boxplot
myxoc_phylum_AgeRange <- 
  phylum_df %>%
  dplyr::filter(Phylum == "Myxococcota") %>%
  # build the plot 
  ggplot(aes(x = AgeRange, y = Abundance, 
             fill = AgeRange, color = AgeRange)) + 
  geom_boxplot(alpha = 0.5, outlier.shape = NA) + # outliers not plotted here in boxplot 
  geom_jitter() + 
  theme_bw() + 
  labs(title = "Myxococcota Phylum") + 
  scale_color_manual(values = AgeRange_colors) + 
  scale_fill_manual(values = AgeRange_colors) + 
    theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
          legend.position = "right")
# Statistically: Kruskall-Wallis followed by a Tukey's Posthoc test
# These are non-parametric (non-normal) stat tests 

# Separate by Sample Site and test with KW
phylum_df %>%
  filter(Phylum == "Myxococcota") %>%
  ggplot(aes(x = AgeRange, y = Abundance, fill = AgeRange, color = AgeRange)) + 
  geom_boxplot(alpha = 0.5, outlier.shape = NA) + 
  geom_jitter(width = 0.2, alpha = 0.6) + 
  facet_wrap(~ SampleSite) + 
  #stat_compare_means(method = "kruskal.test", label = "p.format") + 
  scale_color_manual(values = AgeRange_colors) + 
  scale_fill_manual(values = AgeRange_colors) + 
  labs(title = "Myxococcota Phylum by Sample Site and Age Range") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
        legend.position = "right")

# CONTINUOUS 
myxoc_phylum_weight <- 
  phylum_df %>%
  dplyr::filter(Phylum == "Myxococcota") %>%
  ggplot(aes(x = Weight, y = Abundance)) + 
  geom_point(aes(color = AgeRange)) + 
  theme_bw() + 
  geom_smooth(method = "lm",  formula = y ~ poly(x, 2)) + 
  labs(title = "Myxococcota Phylum") + 
  scale_color_manual(values = AgeRange_colors) + 
  theme(legend.position = "none")

# Collect both of the plots together into one 
myxoc_phylum_AgeRange + myxoc_phylum_weight + 
  plot_layout(guides = "collect") +
  plot_annotation(tag_levels = "A")

lower in sub-adults?

A3. Planctomycetota

# Narrow in on a specific group
# Planctomycetota - y: abundance, x: age, dot plot + boxplot
plant_phylum_AgeRange <- 
  phylum_df %>%
  dplyr::filter(Phylum == "Planctomycetota") %>%
  # build the plot 
  ggplot(aes(x = AgeRange, y = Abundance, 
             fill = AgeRange, color = AgeRange)) + 
  geom_boxplot(alpha = 0.5, outlier.shape = NA) + # outliers not plotted here in boxplot 
  geom_jitter() + 
  theme_bw() + 
  labs(title = "Planctomycetota Phylum") + 
  scale_color_manual(values = AgeRange_colors) + 
  scale_fill_manual(values = AgeRange_colors) + 
    theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
          legend.position = "right")
# Statistically: Kruskall-Wallis followed by a Tukey's Posthoc test
# These are non-parametric (non-normal) stat tests 

# Separate by Sample Site and test with KW
phylum_df %>%
  filter(Phylum == "Planctomycetota") %>%
  ggplot(aes(x = AgeRange, y = Abundance, fill = AgeRange, color = AgeRange)) + 
  geom_boxplot(alpha = 0.5, outlier.shape = NA) + 
  geom_jitter(width = 0.2, alpha = 0.6) + 
  facet_wrap(~ SampleSite) + 
  #stat_compare_means(method = "kruskal.test", label = "p.format") + 
  scale_color_manual(values = AgeRange_colors) + 
  scale_fill_manual(values = AgeRange_colors) + 
  labs(title = "Planctomycetota Phylum by Sample Site and Age Range") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
        legend.position = "right")

# CONTINUOUS 
plant_phylum_weight <- 
  phylum_df %>%
  dplyr::filter(Phylum == "Planctomycetota") %>%
  ggplot(aes(x = Weight, y = Abundance)) + 
  geom_point(aes(color = AgeRange)) + 
  theme_bw() + 
  geom_smooth(method = "lm",  formula = y ~ poly(x, 2)) + 
  labs(title = "Planctomycetota Phylum") + 
  scale_color_manual(values = AgeRange_colors) + 
  theme(legend.position = "none")

# Collect both of the plots together into one 
plant_phylum_AgeRange + plant_phylum_weight + 
  plot_layout(guides = "collect") +
  plot_annotation(tag_levels = "A")

not rly different

A4. Bacteroidota

# Narrow in on a specific group
# Bacteroidota - y: abundance, x: age, dot plot + boxplot
bacter_phylum_AgeRange <- 
  phylum_df %>%
  dplyr::filter(Phylum == "Bacteroidota") %>%
  # build the plot 
  ggplot(aes(x = AgeRange, y = Abundance, 
             fill = AgeRange, color = AgeRange)) + 
  geom_boxplot(alpha = 0.5, outlier.shape = NA) + # outliers not plotted here in boxplot 
  geom_jitter() + 
  theme_bw() + 
  labs(title = "Bacteroidota Phylum") + 
  scale_color_manual(values = AgeRange_colors) + 
  scale_fill_manual(values = AgeRange_colors) + 
    theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
          legend.position = "right")
# Statistically: Kruskall-Wallis followed by a Tukey's Posthoc test
# These are non-parametric (non-normal) stat tests 

# Separate by Sample Site and test with KW
phylum_df %>%
  filter(Phylum == "Bacteroidota") %>%
  ggplot(aes(x = AgeRange, y = Abundance, fill = AgeRange, color = AgeRange)) + 
  geom_boxplot(alpha = 0.5, outlier.shape = NA) + 
  geom_jitter(width = 0.2, alpha = 0.6) + 
  facet_wrap(~ SampleSite) + 
  #stat_compare_means(method = "kruskal.test", label = "p.format") + 
  scale_color_manual(values = AgeRange_colors) + 
  scale_fill_manual(values = AgeRange_colors) + 
  labs(title = "Bacteroidota Phylum by Sample Site and Age Range") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
        legend.position = "right")

# CONTINUOUS 
bacter_phylum_weight <- 
  phylum_df %>%
  dplyr::filter(Phylum == "Bacteroidota") %>%
  ggplot(aes(x = Weight, y = Abundance)) + 
  geom_point(aes(color = AgeRange)) + 
  theme_bw() + 
  geom_smooth(method = "lm",  formula = y ~ poly(x, 2)) + 
  labs(title = "Bacteroidota Phylum") + 
  scale_color_manual(values = AgeRange_colors) + 
  theme(legend.position = "none")

# Collect both of the plots together into one 
bacter_phylum_AgeRange + bacter_phylum_weight + 
  plot_layout(guides = "collect") +
  plot_annotation(tag_levels = "A")

Not rly a difference

Interpretation 4: I would most want to look into Verrucomicrobiota and Bacteroidota.

B. Genus

Let’s first calculate the genus data frame.

# Calculate the Family relative abundance 
# Note: The read depth MUST be normalized in some way: scale_reads
genus_df <- 
  scaled_physeq %>%
  # agglomerate at the phylum level 
  tax_glom(taxrank = "Genus") %>% 
  # Transform counts to relative abundance 
  transform_sample_counts(function (x) {x/sum(x)}) %>%
  # Melt to a long format 
  psmelt() %>%
  # fix the order of sample site
  mutate(SampleSite = fct_relevel(SampleSite, c("CLOACA", "ORAL", "TANK WATER")),
         AgeRange = fct_relevel(AgeRange, c("JUVENILE", "SUB-ADULT",
                                          "ADULT")))

B1. Verrucomicrobiota Genera

# Verrucomicrobiota
# Plot genus 
verr_genus_AgeRange <- 
  genus_df %>%
  dplyr::filter(Phylum == "Verrucomicrobiota") %>%
  dplyr::filter(SampleSite == "CLOACA") %>%
  # At first, plot all of the genera and then subset the ones that have intersting trends
  dplyr::filter(Genus %in% c("Rubritalea", "Diplosphaera")) %>%
  # build the plot 
  ggplot(aes(x = AgeRange, y = Abundance, 
             fill = AgeRange, color = AgeRange)) + 
  facet_wrap(.~Genus, scales = "free_y", nrow = 1) + 
  geom_boxplot(alpha = 0.5, outlier.shape = NA) + # outliers not plotted here in boxplot 
  geom_jitter() + 
  theme_bw() + 
  labs(title = "Verrucomicrobiota Genera in Cloaca") + 
  scale_color_manual(values = AgeRange_colors) + 
  scale_fill_manual(values = AgeRange_colors) + 
  theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
        legend.position = "none")

# Plot genus: Continuous 
verr_genus_weight <- 
  genus_df %>%
  dplyr::filter(Phylum == "Verrucomicrobiota") %>%
  dplyr::filter(SampleSite == "CLOACA") %>%
  dplyr::filter(Genus %in% c("Rubritalea", "Diplosphaera")) %>%
  # build the plot 
  ggplot(aes(x = Weight, y = Abundance)) + 
  facet_wrap(.~Genus, scales = "free_y", nrow = 1) + 
  geom_point(aes(color = AgeRange)) +  theme_bw() + 
  geom_smooth(method = "lm",  formula = y ~ poly(x, 2)) + 
  labs(title = "Verrucomicrobiota Genera in Cloaca") + 
  scale_color_manual(values = AgeRange_colors) + 
  theme(legend.position = "none")

# Collect the Actinomycetota Plots
verr_genus_AgeRange / verr_genus_weight + 
  plot_layout(guides = "collect") +
  plot_annotation(tag_levels = "A")

There’s rly no difference

B2. Bacteroidota Genera

# Bacteroidota
# Plot genus 
bacter_genus_AgeRange <- 
  genus_df %>%
  dplyr::filter(Phylum == "Bacteroidota") %>%
  dplyr::filter(SampleSite == "CLOACA") %>%
  # At first, plot all of the genera and then subset the ones that have intersting trends
  dplyr::filter(Genus %in% c("Tenacibaculum", "Bacteroides")) %>%
  # build the plot 
  ggplot(aes(x = AgeRange, y = Abundance, 
             fill = AgeRange, color = AgeRange)) + 
  facet_wrap(.~Genus, scales = "free_y", nrow = 1) + 
  geom_boxplot(alpha = 0.5, outlier.shape = NA) + # outliers not plotted here in boxplot 
  geom_jitter() + 
  theme_bw() + 
  labs(title = "Bacteroidota Genera in Cloaca") + 
  scale_color_manual(values = AgeRange_colors) + 
  scale_fill_manual(values = AgeRange_colors) + 
  theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
        legend.position = "none")

# Plot genus: Continuous 
bacter_genus_weight <- 
  genus_df %>%
  dplyr::filter(Phylum == "Bacteroidota") %>%
  dplyr::filter(SampleSite == "CLOACA") %>%
  dplyr::filter(Genus %in% c("Tenacibaculum", "Bacteroides")) %>%
  # build the plot 
  ggplot(aes(x = Weight, y = Abundance)) + 
  facet_wrap(.~Genus, scales = "free_y", nrow = 1) + 
  geom_point(aes(color = AgeRange)) +  theme_bw() + 
  geom_smooth(method = "lm",  formula = y ~ poly(x, 2)) + 
  labs(title = "Bacteroidota Genera in Cloaca") + 
  scale_color_manual(values = AgeRange_colors) + 
  theme(legend.position = "none")

# Collect the Bacteroidota Plots
bacter_genus_AgeRange / bacter_genus_weight + 
  plot_layout(guides = "collect") +
  plot_annotation(tag_levels = "A")

C. ASV level

Now, let’s take a look at the ASVs. This is where the real ecology/biology happens! This is because the ASV-level plots will provide us with a more detailed view of which specific taxa (ASVs) are driving the overall trends seen at the higher taxonomic levels (e.g. phylum) in relation to salinity gradients and station differences.

Before we calculate the ASV-level abundances, let’s take a second to think about who we’d like to consider. There’s a lot of ASVs! In fact, there are 6060! It is difficult analyze all of them! Therefore, we will remove ASVs that have an overall count across the entire dataset of 1,732 or 1% of a scaled sample to 17,323. This is a little arbitrary and therefore, you may choose a different threshold for your dataset. The goal here is to dramatically decreases the number of ASVs in the dataset to help make the ASV-level data analysis much easier.

Of course, if we had more time in class, we could run differential abundance and this would also help us to identify and statistically test which ASVs are important and also help us narrow on specific ASVs.

The goal with this lesson is to do some data exploration and then also test ASVs that are relevant for our specific study. Let’s go!

# Calculate the Family relative abundance 
# Note: The read depth MUST be normalized in some way: scale_reads
ASV_df <- 
  scaled_physeq %>%
  # Prune out ASVs that have fewer than 100 counts! 
  ## LOOK AT HOW MANY ARE REMOVED! We scaled to 17,323 reads! :O
  prune_taxa(taxa_sums(.) >= 1732, .) %>%
  # agglomerate at the phylum level 
  tax_glom(taxrank = "ASV") %>% 
  # Transform counts to relative abundance 
  transform_sample_counts(function (x) {x/sum(x)}) %>%
  # Melt to a long format 
  psmelt() %>%
  # fix the order of age sample site
  mutate(SampleSite = fct_relevel(SampleSite, c("CLOACA", "ORAL", "TANK WATER")),
         AgeRange = fct_relevel(AgeRange, c("JUVENILE", "SUB-ADULT",
                                          "ADULT")))

C1. Bacillota ASVs

There are only Bacillota and Bacteroidota in this dataframe, so I cannot investigate the others.

# Calculate top couple of ASVs 
# Make a list of phyla the top phyla 
top_bacil_ASVs <- 
  ASV_df %>%
  dplyr::filter(Phylum == "Bacillota") %>%
  group_by(ASV) %>%
  summarize(mean_Abundance = mean(Abundance)) %>%
  dplyr::filter(mean_Abundance > 0.005) %>%
  pull(ASV)

# Bacillota
# Plot ASVs 
bacil_asv_AgeRange <- 
  ASV_df %>%
  dplyr::filter(ASV %in% top_bacil_ASVs) %>%
  # build the plot 
  ggplot(aes(x = AgeRange, y = Abundance, 
             fill = AgeRange, color = AgeRange)) + 
  facet_wrap(Genus~ASV, scales = "free_y", nrow = 2) + 
  geom_boxplot(alpha = 0.5, outlier.shape = NA) + # outliers not plotted here in boxplot 
  geom_jitter() + 
  theme_bw() + 
  labs(title = "Bacillota ASVs") + 
  scale_color_manual(values = AgeRange_colors) + 
  scale_fill_manual(values = AgeRange_colors) + 
  theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
        legend.position = "none")

# Plot ASVs: Continuous 
bacil_asv_weight <- 
  ASV_df %>%
  dplyr::filter(ASV %in% top_bacil_ASVs) %>%
  # build the plot 
  ggplot(aes(x = Weight, y = Abundance)) + 
  facet_wrap(Genus~ASV, scales = "free_y", nrow = 2) + 
  geom_point(aes(color = AgeRange)) +  theme_bw() + 
  geom_smooth(method = "lm",  formula = y ~ poly(x, 2)) + 
  labs(title = "Bacillota ASVs") + 
  scale_color_manual(values = AgeRange_colors) + 
  theme(legend.position = "none")

# Collect the Bacillota Plots
bacil_asv_AgeRange / bacil_asv_weight + 
  plot_layout(guides = "collect") +
  plot_annotation(tag_levels = "A")

there’s rly no difference

C2. Bacteroidota ASVs

# Calculate top couple of ASVs 
# Make a list of phyla the top phyla 
top_bacter_ASVs <- 
  ASV_df %>%
  dplyr::filter(Phylum == "Bacteroidota") %>%
  group_by(ASV) %>%
  summarize(mean_Abundance = mean(Abundance)) %>%
  dplyr::filter(mean_Abundance > 0.005) %>%
  pull(ASV)

# Bacteroidota
# Plot ASVs 
bacter_asv_AgeRange <- 
  ASV_df %>%
  dplyr::filter(ASV %in% top_bacter_ASVs) %>%
  # build the plot 
  ggplot(aes(x = AgeRange, y = Abundance, 
             fill = AgeRange, color = AgeRange)) + 
  facet_wrap(Genus~ASV, scales = "free_y", nrow = 2) + 
  geom_boxplot(alpha = 0.5, outlier.shape = NA) + # outliers not plotted here in boxplot 
  geom_jitter() + 
  theme_bw() + 
  labs(title = "Bacteroidota ASVs") + 
  scale_color_manual(values = AgeRange_colors) + 
  scale_fill_manual(values = AgeRange_colors) + 
  theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
        legend.position = "none")

# Plot ASVs: Continuous 
bacter_asv_weight <- 
  ASV_df %>%
  dplyr::filter(ASV %in% top_bacter_ASVs) %>%
  # build the plot 
  ggplot(aes(x = Weight, y = Abundance)) + 
  facet_wrap(Genus~ASV, scales = "free_y", nrow = 2) + 
  geom_point(aes(color = AgeRange)) +  theme_bw() + 
  geom_smooth(method = "lm",  formula = y ~ poly(x, 2)) + 
  labs(title = "Bacillota ASVs") + 
  scale_color_manual(values = AgeRange_colors) + 
  theme(legend.position = "none")

# Collect the Bacteroidota Plots
bacter_asv_AgeRange / bacter_asv_weight + 
  plot_layout(guides = "collect") +
  plot_annotation(tag_levels = "A")

# choose specific ASVs to make graphs easier to see 
# Plot ASVs 
bacter_asv_AgeRange_filtered <- 
  ASV_df %>%
  dplyr::filter(ASV %in% top_bacter_ASVs) %>%
  dplyr::filter(SampleSite == "CLOACA") %>%
  dplyr::filter(ASV %in% c("ASV_0010", "ASV_0031", "ASV_0044", "ASV_0083", "ASV_0074")) %>%
  # build the plot 
  ggplot(aes(x = AgeRange, y = Abundance, 
             fill = AgeRange, color = AgeRange)) + 
  facet_wrap(Genus~ASV, scales = "free_y", nrow = 2) + 
  geom_boxplot(alpha = 0.5, outlier.shape = NA) + # outliers not plotted here in boxplot 
  geom_jitter() + 
  theme_bw() + 
  labs(title = "Bacteroidota ASVs") + 
  scale_color_manual(values = AgeRange_colors) + 
  scale_fill_manual(values = AgeRange_colors) + 
  theme(axis.text.x = element_text(angle = 30, hjust = 1, vjust = 1),
        legend.position = "none")

# Plot ASVs: Continuous 
bacter_asv_weight_filtered <- 
  ASV_df %>%
  dplyr::filter(ASV %in% top_bacter_ASVs) %>%
  dplyr::filter(SampleSite == "CLOACA") %>%
  dplyr::filter(ASV %in% c("ASV_0010", "ASV_0031", "ASV_0044", "ASV_0083", "ASV_0074")) %>%
  # build the plot 
  ggplot(aes(x = Weight, y = Abundance)) + 
  facet_wrap(Genus~ASV, scales = "free_y", nrow = 2) + 
  geom_point(aes(color = AgeRange)) +  theme_bw() + 
  geom_smooth(method = "lm",  formula = y ~ poly(x, 2)) + 
  labs(title = "Bacillota ASVs") + 
  scale_color_manual(values = AgeRange_colors) + 
  theme(legend.position = "none")

# Collect the Bacteroidota Plots
bacter_asv_AgeRange_filtered / bacter_asv_weight_filtered + 
  plot_layout(guides = "collect") +
  plot_annotation(tag_levels = "A")

ASV_0002 gained with age

ASV_0097 lost with age

ASV_0055 completely gone in adults

Potentially slight trend of losing ASVs with age

Interpretation 6: replace

Interpretation 7: replace

Interpretation 8: replace

Session Information

For reproducibility

devtools::session_info()
## ─ Session info ───────────────────────────────────────────────────────────────
##  setting  value
##  version  R version 4.2.3 (2023-03-15)
##  os       Rocky Linux 9.4 (Blue Onyx)
##  system   x86_64, linux-gnu
##  ui       X11
##  language (EN)
##  collate  en_US.UTF-8
##  ctype    en_US.UTF-8
##  tz       America/New_York
##  date     2025-04-29
##  pandoc   3.1.1 @ /usr/lib/rstudio-server/bin/quarto/bin/tools/ (via rmarkdown)
## 
## ─ Packages ───────────────────────────────────────────────────────────────────
##  package          * version    date (UTC) lib source
##  ade4               1.7-23     2025-02-14 [1] CRAN (R 4.2.3)
##  ape                5.8-1      2024-12-16 [1] CRAN (R 4.2.3)
##  Biobase            2.58.0     2022-11-01 [2] Bioconductor
##  BiocGenerics       0.44.0     2022-11-01 [2] Bioconductor
##  biomformat         1.26.0     2022-11-01 [1] Bioconductor
##  Biostrings         2.66.0     2022-11-01 [2] Bioconductor
##  bitops             1.0-9      2024-10-03 [2] CRAN (R 4.2.3)
##  bslib              0.8.0      2024-07-29 [2] CRAN (R 4.2.3)
##  cachem             1.1.0      2024-05-16 [2] CRAN (R 4.2.3)
##  cli                3.6.4      2025-02-13 [1] CRAN (R 4.2.3)
##  cluster            2.1.4      2022-08-22 [2] CRAN (R 4.2.3)
##  codetools          0.2-19     2023-02-01 [2] CRAN (R 4.2.3)
##  colorspace         2.1-1      2024-07-26 [2] CRAN (R 4.2.3)
##  crayon             1.5.3      2024-06-20 [2] CRAN (R 4.2.3)
##  crosstalk          1.2.1      2023-11-23 [2] CRAN (R 4.2.3)
##  data.table         1.16.2     2024-10-10 [2] CRAN (R 4.2.3)
##  devtools         * 2.4.5      2022-10-11 [1] CRAN (R 4.2.3)
##  digest             0.6.37     2024-08-19 [2] CRAN (R 4.2.3)
##  dplyr            * 1.1.4      2023-11-17 [2] CRAN (R 4.2.3)
##  DT               * 0.33       2024-04-04 [1] CRAN (R 4.2.3)
##  ellipsis           0.3.2      2021-04-29 [2] CRAN (R 4.2.3)
##  evaluate           1.0.1      2024-10-10 [2] CRAN (R 4.2.3)
##  farver             2.1.2      2024-05-13 [2] CRAN (R 4.2.3)
##  fastmap            1.2.0      2024-05-15 [2] CRAN (R 4.2.3)
##  forcats          * 1.0.0      2023-01-29 [2] CRAN (R 4.2.3)
##  foreach            1.5.2      2022-02-02 [1] CRAN (R 4.2.3)
##  fs                 1.6.5      2024-10-30 [2] CRAN (R 4.2.3)
##  generics           0.1.3      2022-07-05 [2] CRAN (R 4.2.3)
##  GenomeInfoDb       1.34.9     2023-02-02 [2] Bioconductor
##  GenomeInfoDbData   1.2.9      2024-11-25 [2] Bioconductor
##  ggplot2          * 3.5.1      2024-04-23 [2] CRAN (R 4.2.3)
##  glue               1.8.0      2024-09-30 [2] CRAN (R 4.2.3)
##  gtable             0.3.6      2024-10-25 [2] CRAN (R 4.2.3)
##  hms                1.1.3      2023-03-21 [2] CRAN (R 4.2.3)
##  htmltools          0.5.8.1    2024-04-04 [2] CRAN (R 4.2.3)
##  htmlwidgets        1.6.4      2023-12-06 [2] CRAN (R 4.2.3)
##  httpuv             1.6.15     2024-03-26 [2] CRAN (R 4.2.3)
##  igraph             2.1.1      2024-10-19 [2] CRAN (R 4.2.3)
##  IRanges            2.32.0     2022-11-01 [2] Bioconductor
##  iterators          1.0.14     2022-02-05 [1] CRAN (R 4.2.3)
##  jquerylib          0.1.4      2021-04-26 [2] CRAN (R 4.2.3)
##  jsonlite           1.8.9      2024-09-20 [2] CRAN (R 4.2.3)
##  knitr              1.49       2024-11-08 [2] CRAN (R 4.2.3)
##  labeling           0.4.3      2023-08-29 [2] CRAN (R 4.2.3)
##  later              1.3.2      2023-12-06 [2] CRAN (R 4.2.3)
##  lattice            0.20-45    2021-09-22 [2] CRAN (R 4.2.3)
##  lifecycle          1.0.4      2023-11-07 [2] CRAN (R 4.2.3)
##  lubridate        * 1.9.3      2023-09-27 [2] CRAN (R 4.2.3)
##  magrittr           2.0.3      2022-03-30 [2] CRAN (R 4.2.3)
##  MASS               7.3-58.2   2023-01-23 [2] CRAN (R 4.2.3)
##  Matrix             1.6-4      2023-11-30 [2] CRAN (R 4.2.3)
##  memoise            2.0.1      2021-11-26 [2] CRAN (R 4.2.3)
##  mgcv               1.8-42     2023-03-02 [2] CRAN (R 4.2.3)
##  mime               0.12       2021-09-28 [2] CRAN (R 4.2.3)
##  miniUI             0.1.1.1    2018-05-18 [2] CRAN (R 4.2.3)
##  multtest           2.54.0     2022-11-01 [1] Bioconductor
##  munsell            0.5.1      2024-04-01 [2] CRAN (R 4.2.3)
##  nlme               3.1-162    2023-01-31 [2] CRAN (R 4.2.3)
##  pacman             0.5.1      2019-03-11 [1] CRAN (R 4.2.3)
##  patchwork        * 1.3.0.9000 2025-02-19 [1] Github (thomasp85/patchwork@2695a9f)
##  permute            0.9-7      2022-01-27 [1] CRAN (R 4.2.3)
##  phyloseq         * 1.42.0     2022-11-01 [1] Bioconductor
##  pillar             1.10.1     2025-01-07 [1] CRAN (R 4.2.3)
##  pkgbuild           1.4.5      2024-10-28 [2] CRAN (R 4.2.3)
##  pkgconfig          2.0.3      2019-09-22 [2] CRAN (R 4.2.3)
##  pkgload            1.4.0      2024-06-28 [2] CRAN (R 4.2.3)
##  plyr               1.8.9      2023-10-02 [2] CRAN (R 4.2.3)
##  profvis            0.4.0      2024-09-20 [2] CRAN (R 4.2.3)
##  promises           1.3.0      2024-04-05 [2] CRAN (R 4.2.3)
##  purrr            * 1.0.2      2023-08-10 [2] CRAN (R 4.2.3)
##  R6                 2.6.1      2025-02-15 [1] CRAN (R 4.2.3)
##  Rcpp               1.0.13-1   2024-11-02 [2] CRAN (R 4.2.3)
##  RCurl              1.98-1.16  2024-07-11 [2] CRAN (R 4.2.3)
##  readr            * 2.1.5      2024-01-10 [2] CRAN (R 4.2.3)
##  remotes            2.5.0      2024-03-17 [2] CRAN (R 4.2.3)
##  reshape2           1.4.4      2020-04-09 [2] CRAN (R 4.2.3)
##  rhdf5              2.42.1     2023-04-07 [1] Bioconductor
##  rhdf5filters       1.10.1     2023-03-24 [1] Bioconductor
##  Rhdf5lib           1.20.0     2022-11-01 [1] Bioconductor
##  rlang              1.1.5      2025-01-17 [1] CRAN (R 4.2.3)
##  rmarkdown          2.29       2024-11-04 [2] CRAN (R 4.2.3)
##  rstudioapi         0.17.1     2024-10-22 [2] CRAN (R 4.2.3)
##  S4Vectors          0.36.2     2023-02-26 [2] Bioconductor
##  sass               0.4.9      2024-03-15 [2] CRAN (R 4.2.3)
##  scales             1.3.0      2023-11-28 [2] CRAN (R 4.2.3)
##  sessioninfo        1.2.2      2021-12-06 [2] CRAN (R 4.2.3)
##  shiny              1.9.1      2024-08-01 [2] CRAN (R 4.2.3)
##  stringi            1.8.4      2024-05-06 [2] CRAN (R 4.2.3)
##  stringr          * 1.5.1      2023-11-14 [2] CRAN (R 4.2.3)
##  survival           3.5-3      2023-02-12 [2] CRAN (R 4.2.3)
##  tibble           * 3.2.1      2023-03-20 [2] CRAN (R 4.2.3)
##  tidyr            * 1.3.1      2024-01-24 [2] CRAN (R 4.2.3)
##  tidyselect         1.2.1      2024-03-11 [2] CRAN (R 4.2.3)
##  tidyverse        * 2.0.0      2023-02-22 [2] CRAN (R 4.2.3)
##  timechange         0.3.0      2024-01-18 [2] CRAN (R 4.2.3)
##  tzdb               0.4.0      2023-05-12 [2] CRAN (R 4.2.3)
##  urlchecker         1.0.1      2021-11-30 [2] CRAN (R 4.2.3)
##  usethis          * 3.0.0      2024-07-29 [2] CRAN (R 4.2.3)
##  vctrs              0.6.5      2023-12-01 [2] CRAN (R 4.2.3)
##  vegan              2.6-10     2025-01-29 [1] CRAN (R 4.2.3)
##  withr              3.0.2      2024-10-28 [2] CRAN (R 4.2.3)
##  xfun               0.49       2024-10-31 [2] CRAN (R 4.2.3)
##  xtable             1.8-4      2019-04-21 [2] CRAN (R 4.2.3)
##  XVector            0.38.0     2022-11-01 [2] Bioconductor
##  yaml               2.3.10     2024-07-26 [2] CRAN (R 4.2.3)
##  zlibbioc           1.44.0     2022-11-01 [2] Bioconductor
## 
##  [1] /home/jt698/R/x86_64-pc-linux-gnu-library/4.2
##  [2] /programs/R-4.2.3/lib64/R/library
## 
## ──────────────────────────────────────────────────────────────────────────────